Project-Team:STARS

Inria | Raweb 2013 | Presentation of the Project-Team STARS | STARS Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Tracking and Video Representation

Participants : Ratnesh Kumar, Guillaume Charpiat, Monique Thonnat.

keywords: Fibers, Graph Partitioning, Message Passing, Iterative Conditional Modes, Video Segmentation, Video Inpainting

Multiple Object Tracking The objective is to find trajectories of objects (belonging to a particular category) in a video. To find possible occupancy locations, an object detector is applied to all frames of a video, yielding bounding boxes. Detectors are not perfect and may provide false detections; they may also miss objects sometimes. We build a graph of all detections, and aim at partitioning the graph into object trajectories. Edges in the graph encode factors between detections, based on the following :

Number of common point tracks between bounding boxes (the tracks are obtained from an optical-flow-based point tracker)
Global appearance similarity (based on the pixel colors inside the bounding boxes)
Trajectory straightness : for three bounding boxes at different frames, we compute the Laplacian (centered at the middle frame) of the centroids of the boxes.
Repulsive constraint : Two detections in a same frame cannot belong to the same trajectory.

We compute the partitions by using sequential tree re-weighted message passing (TRW-S). To avoid local minima, we use a label flipper motivated from the Iterative Conditional Modes algorithm.

We apply our approach to typical surveillance videos where object of interest are humans. Comparative quantitative results can be seen in Tables 1 and 2 for two videos. The evaluation metrics considered are : Recall, Precision, Average False Alarms Per Frame (FAF), Number of Groundtruth Trajectories (GT), Number of Mostly Tracked Trajectories, Number of Fragments (Frag), Number of Identity Switches (IDS), Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP).

This work has been submitted to CVPR' 14.

**Table 1.** Towncenter Video Output
Method	MOTA	MOTP	Detector
[59] (450-750)	56.8	79.6	HOG
Ours (450-750)	53.5	69.1	HOG

**Table 2.** Comparison with recent proposed approaches on PETS S2L1 Video
Method	Recall	Precision	FAF	GT	MT	Frag	IDS
[77]	96.9	94.1	0.36	19	18	15	22
Ours	95.4	93.4	0.28	19	18	42	13

Video Representation We continued our work from the previous year on Fiber-Based Video Representation. During this year we focused on obtaining competitive results with the state-of-the-art (Figure 13 ).

Figure 13. Top Row: Left image displays a sequence as a volumetric display. Right image displays all fibers found, clustered at a particular hierarchy. Bottom Row : Left Image displays the highest level of the hierarchical clustering, with fiber extension. Right Image shows the result obtained from [71] . Our result demonstrates better long term temporal coherency.

The usefulness of our novel representation is demonstrated by a simple video inpainting task. Here a user input of only 7 clicks is required to remove the dancing girl disturbing the news reporter (Figure 14 ).

Figure 14. Inpainting task. Left : Original video (top) and

x t

slice (bottom) showing trajectories. Right : Our result. Clusters of fibers were computed and selected with only 7 mouse clicks to distinguish the disturbing girl from the reporter and background. The girl was removed and the hole was filled by extending the background fibers in time.

This work has been accepted for publication next year [41] .

Previous |

Home | Next next